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Abstract 

Author name disambiguation in bibliographic databases is the problem of 
grouping together scientific publications written by the same person, ac¬ 
counting for potential homonyms and/or synonyms. Among solutions to 
this problem, digital libraries are increasingly offering tools for authors to 
manually curate their publications and claim those that are theirs. In¬ 
directly, these tools allow for the inexpensive collection of large annotated 
training data, which can be further leveraged to build a complementary au¬ 
tomated disambiguation system capable of inferring patterns for identifying 
publications written by the same person. Building on more than 1 million 
publicly released crowdsourced annotations, we propose an automated au¬ 
thor disambiguation solution exploiting this data (i) to learn an accurate 
classifier for identifying coreferring authors and (ii) to guide the clustering 
of scientific publications by distinct authors in a semi-supervised way. To 
the best of our knowledge, our analysis is the first to be carried out on data 
of this size and coverage. With respect to the state of the art, we validate 
the general pipeline used in most existing solutions, and improve by: (i) 
proposing phonetic-based blocking strategies, thereby increasing recall; and 
(ii) adding strong ethnicity-sensitive features for learning a linkage function, 
thereby tailoring disambiguation to non-Western author names whenever 
necessary. 


1 Introduction 


In academic digital libraries, author name disambiguation is the problem of grouping to¬ 
gether publications written by the same person. Author name disambiguation is often a 
difficult problem because an author may use different spellings or name variants across their 
career (synonymy) and/or distinct authors may share the same name (polysemy). Most 
notably, author disambiguation is often more troublesome for researchers from non-Western 
cultures, where personal names may be traditionally less diverse (leading to homonym is¬ 
sues) or for which transliteration to Latin characters may not be unique (leading to synonym 
issues). With the fast growth of the scientific literature, author disambiguation has become 
a pressing issue since the accuracy of information managed at the level of individuals directly 
affects: the relevance search of results {e.g., when querying for all publications written by a 
given author); the reliability of bibli ometrics and author ranking s (e.(?., citation counts or 
other impact metrics, as studied in (Strotmann and Zha^ 20121); and/or the relevance of 


scientific network analysis (Newman, 20011. 


nities (|Liu et al. 
through 


Efforts and solutions to author disambiguation have been proposed from various commu- 

On the one hand, libraries have maintained authorship control 
either in a centralized way by hiring professional collaborators 


2014D . 
ti manual curation. 


or through developing services that invite authors to register their publications themselves 
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{e.g., Google Scholar or Inspire-HEP). Recent efforts to create persistent digital identifiers 
assigned to researchers {e.g., ORCID or ResearcherlD), with the objective to embed these 
identifiers in the submission workflow of publishers or repositories {e.g., Elsevier, arXiv 
or Inspire-HEP), would univocally solve any disambiguation issue. With the large cost 
of centralized manual authorship control, or until crowdsourced solutions are more widely 
adopted, the impact of these efforts are unfortunately limited by the efficiency, motivation 
and integrity of their active contributors. Similarly, the success of persistent digital identifier 
efforts is conditioned to a large and ubiquitous adoption by both researchers and publish¬ 
ers. Eor these reasons, fully automated machine learning-based methods have been proposed 
during the past decade to provide immediate, less costly, and satisfactory solutions to author 
disambiguation. In this work, our goal is to explore and demonstrate how both approaches 
can coexist and benefit from each other. In particular, we study how labeled data obtained 
through manual curation (either centralized or crowdsourced) can be exploited (i) to learn 
an accurate classifier for identifying coreferring authors, and (ii) to guide the clustering of 
scientific publications by distinct authors in a semi-supervised way. Our analysis of pa¬ 
rameters and features of this large dataset reveal that the general pipeline commonly used 
in existing solutions is an effective approach for author disambiguation. Additionally, we 
propose alternative strategies for blocking based on the phonetization of author names to 
increase recall. We also propose ethnicity-sensitive features for learning a linkage function, 
thereby tailoring author disambiguation to non-Western author names whenever necessary. 

The remainder of this report is structured as follows. In Sectionj^ we briefly review machine 
learning solutions for author disambiguation. The components of our method are then 
defined in Section]^ and its implementation described in Section]^ Experiments are carried 
out in Section where we explore and validate features for the supervised learning of a 
linkage function and compare strategies for the semi-supervised clustering of publications. 
Finally, conclusions and future works are discussed in Section 


2 Related work 


Smalheiser and Torvik 

2009 

Ferreira et al. 2012 

Levin et al. 

2012 


disambiguation algorithms are usually composed of two mam components: (i) a linkage 
function determining whether two publications have been written by the same author; and 
(ii) a clustering algorithm producing clusters of publications assumed to be written by the 
same author. Approaches can be classified along several axes, depending on the type and 
amount of data available, the way the linkage function is learned or defined, or the clustering 
procedure used to group publications. Methods relying on supervised learning usually make 
use of a small set of hand-labeled pairs of publications identified as being either from the 


dHan et al. 

2004 

Huang et al. 

2006 

Culotta et al. 

2007 

Treeratpituk and Giles 

2009 

'Iran et al. 

2014). 


Training data is usually not easily available, therefore unsupervised approaches propose 
the use of dom ain-specific, manually designed, linkage functions tailored towards author 


disam biguation ( Malin 2005 McRae-Spencer and Shadbolt| 2006 Song et al., 2007 Soler 


2007 |Kang et aT 2009 Fan et al.| 2011 Schulz et al.| 20141). These approaches have 
the advantage of not requiring hand-labeled data, but generally do not perform as well 


the advantage of not requiring hand-labeled data, but generally do not perform as well 
as supervised approaches. To reconcile both worlds, semi-supervised methods make use of 
small, manually verified, clusters of publications and/or high-precision domain-specific rules 
to build a training set of pairs of publications, from which a linkage function is then built 


using supervised learning (|Ferreira et al.| |2010| |Torvik and Smalheiser 

mm. 


2009 Levin et al. 


Semi-supervised approaches also allow for the tuning of the clustering algorithm when the 
latter is applied to a mixed set of labeled and unlabeled p ublications, e.g., by maximizing 


some clustering performance metric on the known clusters (Levin et al. 2012). 


Due to the lack of large and publicly available datasets of curated clusters of publications, 
studies on author disambiguation are usually constrained to validating their results on man¬ 
ually built datasets of limited size and scope (from a few hundred to a few thousand papers. 
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Publications 


Signatures 


Signature for Doe, John 



Title 

Lorem ipsum dolor sit amet. 
consectetur adipiscing elit 

Author 

Dog, John 

Affiliation 

University of Foo 

Co-authors 

Smith. John; Chen. Wang 

Year 

2015 


Figure 1: An example signature s for ”Doe, John”. A signature is defined as unique piece 
of information identifying an author on a publication, along with any other metadata that 
can be derived from it, such as publication title, co-authors or date of publication. 


with sparse coverage of ambiguous cases), making the true performance of these methods of¬ 
ten difficult to assess with high confidence. Additionally, despite devoted efforts to construct 
them, these datasets are rarely public, making it even more difficult to compare methods 
using a common benchmark. 

In this context, we position the work in this paper as a semi-supervised solution for author 
disambiguation, with the significant advantage of having a very large collection of more than 
1 million crowdsourced annotations of publications whose true authors are identified. The 
extent and coverage of this data allows us to revisit, validate and nuance previous findings 
regarding supervised learning of linkage functions, and to better explore strategies for semi- 
supervised clustering. Furthermore, by releasing our data in the public domain, we hope 
to provide a benchmark on which further research on author disambiguation and related 
topics can be built. 


3 Semi-supervised author disambiguation 


Formally, let us assume a set of publications V = {po, ...,pjv-i} along with the set of unique 
individuals A = {oq, ..., om-i} having together authored all publications in V. Let us define 
a signature s € p from a publication as a unique piece of information identifying one of the 
authors of p {e.g., the author name, his affiliation, along with any other metadata that can 
be derived from p, as illustrated in Figurej^. Let us denote by 5 = {s|s G p,p G V} the set 
of all signatures that can be extracted from all publications in V. 

In this framework, author disambiguation can be stated as the problem of finding a partition 
C = { cq , ..., cm - i } of S such that S = U^g Ci (1 Cj = (j) for all i ^ j, and where subsets 
Ci, or clusters, each corresponds to the set of all signatures belonging to the same individual 
Oi- Alternatively, the set A may remain (possibly partially) unknown, such that author 
disambiguation boils down to finding a partition C where subsets Ci each correspond to the 
set of all signatures from the same individual (without knowing who). Finally, in the case 
of partially annotated databases as studied in this work, the set extends with the partial 
knowledge C = {cg,..., of C, such that c' C c^, where c' may be empty. Or put 

otherwise, the set extends with the assumption that all signatures s G c' belong to the same 
author. 

Inspired by several previous works described in Section we cast in this work author 
disambiguation into a semi-supervised clustering problem. Our algorithm is composed of 
three parts, as sketched in Figure (i) a blocking scheme whose goal is to roughly pre¬ 
cluster signatures S into smaller groups in order to reduce computational complexity; (ii) 
the construction of a linkage function d between signatures using supervised learning; and 
(iii) the semi-supervised clustering of all signatures within the same block, using d as a 
pseudo distance metric. 
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Signatures a) Blocking c) Hierarchical Predicted Clusters 

Isl Clustering 



b) Linkage 
Function 

_ — - distanced 


Figure 2: Pipeline for author disambiguation: (a) signatures are blocked to reduce computa¬ 
tional complexity, (b) a linkage function is built with supervised learning, (c) independently 
within each block, signatures are grouped using hierarchical agglomerative clustering. 


3.1 Blocking 


As in previous works, the first part of our algorithm consists of dividing signatures S into 


disjoint subsets Sh^, or blocks (Fellegi and Sunter 

out author disambiguation on each one ot these bl ocks independently, 
computational complexity of clustering 
0{^b which is much more tractable as the number of signatures increases. 


1969|), followed by carrying 
By doing so, the 
see Section 3.3) typically reduces from 0(|5p) to 

Since 


disambiguation is performed independently per block, a good blocking strategy should be 
designed such that signatures from the same author are all mapped to the same block, 
otherwise their correct clustering would not be possible in later stages of the workflow. As 
a result, blocking should be a balance between reduced complexity and maximum recall. 


The simplest and most common strategy for blocking, referred to hereon in as Surname and 
First Initial (SFI), groups signatures together if they share the same surname(s) and the 
same first given name initial (e.^., 5'F/(”Doe, John”) == ”Doe, J”). Despite satisfactory 
performance, there are several cases where this simple strategy fails to cluster related pairs 
of signatures together, including: 


1. There are different ways of writing an author name, or signatures contain a typo 
{e.g., "Mueller, R.” and ’’Muller, R.”, ’’Tchaikovsky, P.” and ’’Czajkowski, P.”). 

2. An author has multiple surnames and some signatures place the first part of the 
surname within the given names {e.g., ’’Martinez Torres, A.” and ’’Torres, A. Mar¬ 
tinez”). 

3. An author has multiple surnames and, on some signatures, only the first surname 
is present {e.g., ’’Smith-Jones, A.” and ’’Smith, A.”) 

4. An author has multiple given names and they are not always all recorded {e.g., 
’’Smith, Jack” and ’’Smith, A. J.”) 

5. An authors surname changed {e.g., due to marriage). 


To account for these issues we propose instead to block signatures based on the phonetic 
representation of the normalized surname. Normalization involves stripping accents {e.g., 
’’Jablohski, L” —” Jablonski, L”) and name affixes that inconsistently appear in signatures 
{e.g., ’’van der Waals, J. D.” —» ’’Waals, J. D.”), while phonetization is based either on the 


Double Metaphone (Philips 2000), the NYSIIS (Taft 1970) or the Soundex (The National 


Archives 2007) phonetic algorithms for mapping author names to their pronunciations, 
'i'ogether, these processing steps allow for grouping of most name variants of the same 
person in the same block with a small increase in the overall computational complexity, 
thereby solving case 1. 
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In the case of multiple surnames (cases 2 and 3), we propose to block signatures in two 
phases. In the first phase, all the signatures with a single surname are clustered together. 
Every different surname token creates a new block. In the second phase, the signatures with 
multiple surnames are compared with the blocks for the first and last surname. If the first 
surnames of an author were already used as the last given names on some of the signatures, 
the new signature is assigned to the block of the last surname (case 2). Otherwise, the 
signature is assigned to the block of the first surname (case 3). Finally, to prevent the 
creation of too large blocks, signatures are further divided along their first given name 
initial. Cases 4 and 5 are not explicitly handled. 


3.2 Linkage function 


Supervised classification. The second part of the algorithm is the automatic construction 
of a pair-wise linkage function between signatures for use during the clustering step which 
groups all signatures from the same author. 

Formally, the goal is to build a function d : 5 x 5 !->■ [0,1], such that d(si,S 2 ) approaches 
0 if both signatures Si and S 2 belong to the same author, and 1 otherwise. This problem 
can be cast as a standard supervised classification task, where inputs are pairs of signatures 
and outputs are classes 0 ( same authors), a nd I (distinct authors). In this work, we eval- 
uate Random Fo rests (RF, Breiman (2001) ), Gradient Boo sted Regression Trees (GBRT, 
Friedman (20011), and Logikic Regression (Fan et ah, 2008) as classifiers. 


Input features. In most cases, supervised learning algorithms assume the input space X 
to be numeric {e.g., M^), making them not directly applicable to structured input spaces 
such as S X S. Following previous works, pairs of signatur es (si,S 2 ) are first transformed 


to vectors u S by building so-called similarity profiles (Treeratpituk and Giles 2009) 


on which supervised learning is carried out. In this work, we design and evaluate fifteen 
standard input features based on the comparison of signature fields, as reported in the 
first half of Table As an illustrative example, the Full name feature corresponds to the 
similarity between the (full) author name fields of the two signatures, as measured using as 
combination operator the cosine similarity between their respective (n, m)-TF-IDF vector 
representation^ Similarly, the Year difference feature measures the absolute difference 
between the publication date of the articles to which the two signatures respectively belong. 

Author names from different cultures, origins or ethnic groups are likely to be disambiguated 
using different strategies {e.g., pairs of si gnatures with French author names versus pairs o f 
signatures with Ghinese author names) (Treeratpituk and Giles 2012 Ghin et ah, 2014). 


To support our disambiguation algorithm, we added seven features to our feature set, with 
each evaluating the degree of belonging of both signatures to an ethnic group, as reported 
in the second half of Table [TJ 


More specifically, using census data extracted from ( [Ruggles et al.[|2008 ), we build a support 
vector machine classifier (using a linear kernel and one-versus-all classification scheme) for 
mapping the (1,5)-TF-IDF representation of an author name to one of the seven ethnic 
groups. Given a pair of signatures (si, S 2 ), the proposed ethnicity features are each computed 
as the estimated probability of si belonging to the corresponding ethnic group, multiplied by 
the estimated probability of S 2 belonging to the same group. In doing so, the expectation is 
for the linkage function to become sensitive to the actual origin of the authors depending on 
the values of these features. Indirectly, let us also note that these features hold discriminative 
power since if author names are predicted to belong to different ethnic groups, then they 
are also likely to correspond to distinct people. 

Building a training set. The distinctive aspect of our work is the knowledge of more than 1 
million crowdsourced annotations C = {cq, indicating together that all signature 

s & c{ are known to correspond to the same individual Oi. In particular, this data can be 
used to generate positive pairs {x = {si,S 2 ),y = 0) for all si,S 2 G c', for all i. Similarly, 
negative pairs (a; = (si, 52 ) 1 2/ = 1) can be extracted for all Si G c', S 2 S c', for all i ^ j. 


^(n, m) denotes that the TF-IDF vectors are computed from character n, n 1, 
When not specified, TF-IDF vectors are otherwise computed from words. 


m-grams. 
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Table 1: Input features for learning a linkage function 


Feature 

Combination operator 

Full name 

Gosine similarity of (2,4)-TF-IDF 

Given names 

Gosine similarity of (2,4)-TF-IDF 

First given name 

Jaro-Winkler distance 

Second given name 

Jaro-Winkler distance 

Given name initial 

Equality 

Affiliation 

Gosine similarity of (2,4)-TF-IDF 

Go-authors 

Gosine similarity of TF-IDF 

Title 

Gosine similarity of (2,4)-TF-IDF 

Journal 

Gosine similarity of (2,4)-TF-IDF 

Abstract 

Cosine similarity of TF-IDF 

Keywords 

Cosine similarity of TF-IDF 

Gollaborations 

Cosine similarity of TF-IDF 

References 

Cosine similarity of TF-IDF 

Subject 

Cosine similarity of TF-IDF 

Year difference 

Absolute difference 

White 

Product of estimated probabilities 

Black 

Product of estimated probabilities 

American Indian or Alaska Native 

Product of estimated probabilities 

Ghinese 

Product of estimated probabilities 

Japanese 

Product of estimated probabilities 

Other Asian or Pacific Islander 

Product of estimated probabilities 

Others 

Product of estimated probabilities 


The most straightforward approach for building a training set on which to learn a linkage 
function is to sample an equal number of positive and negative pairs, as suggested above. 
By observing that the linkage function d will eventually be used only on pairs of signatures 
from the same block S},, a further refinement for building a training set is to restrict positive 
and negative pairs (si,S 2 ) to only those for which si and S 2 belong to the same block. In 
doing so, the trained classifier is forced to learn intra-block discriminative patterns rather 


than inter-block differences. Furthermore, as noted in (Lange and Naumann 20111, most 


signature pairs are non-ambiguous: if both signatures share the same author names, then 
they correspond to the same individual, otherwise they do not. Rather than sampling pairs 
uniformly at random, we propose to oversample difficult cases when building the training set 
(ie., pairs of signatures with different author names corresponding to same individual, and 
pairs of signatures with identical author names but corresponding to distinct individuals) 
in order to improve the overall accuracy of the linkage function. 


3.3 Semi-supervised clustering 


The last component of our author disambiguation pipeline is clustering, that is the process of 
grouping together, within a block, all signatures from the same individual (and only those). 
As for many other works on author disambiguation, we make use of hierarchical clustering 


(Ward Jr 1963) for building clusters of signatures in a bottom-up fashion. The method 


involves iteratively merging together the two most similar clusters until all clusters are 
merged together at the top of the hierarchy. Similarity between clusters is evaluated using 
either complete, single or average linkage, using as a pseudo-distance metric the probability 
that Si and S 2 corre spond to distinct authors, as calculated from the custom linkage function 
d from Section 


To form flat clusters from the hierarchy, one must decide on a maximum distance threshold 
above which clusters are considered to correspond to distinct authors. Let us denote by 
iS' = {s|s S c', c' G C'} the set of all signatures for which partial clusters are known. Let us 
also denote by C the predicted clusters for all signatures in S, and by C' = {cn5'|c G C} the 
predicted clusters restricted to signatures for which partial clusters are known. From these, 
we evaluate the following semi-supervised cut-off strategies, as illustrated in Figure]^ 
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Figure 3: Semi-supervised cut-off strategies to form flat clusters of signatures. 


• No cut: all signatures from the same block are assumed to be from the same author. 

• Global cut: the threshold is chosen globally over all blocks, as the one maximizing 
some score f{C',C'). 

• Block cut: the threshold is chosen locally at each block b, as the one maximizing 
some score /(C^,C^). In case is empty, then all signatures from b are clustered 
together. 


4 Implementation 


As part of this work, we developed a stand-alone application for author disambiguation, 
publicly available onlin^ for free reuse or study. Our imple mentation builds upon the 
Python scientific stack, making use of the Scikit-Learn lib rary (Pedregosa e t al. 2011) for 
the supervised learning of a linkage function and of SciPy (Jones et ai. 01 P for clustering. 
All components of the disambiguation pipeline have beeii designed to follow the Scikit- 


Learn API (Buitinck et ah, 2013), making them easy to maintain, understand and reuse. 
Our implementation is made to be efficient, exploiting parallelization when available, and 
ready for production environments. It is also designed to be runnable in an incremental 
fashion, by enabling disambiguation only on specified blocks if desired, instead of having to 
run the disambiguation process on the whole signature set. 


5 Experiments 


5.1 Data 


The author disambiguation solution proposed in this work, along with its enhancements, 


are evaluated on data extracted from the INSPIRE portal (Gentil-Beccot et al. 2009), a 


digital library for scientific literature in high-energy physics. Overall, the portal holds more 
than 1 million publications V, forming in total a set S of more than 10 million signatures. 
Out of these, around 13% have been claimed by their original authors, marked as such by 
professional curators or automatically assigned to their true authors thanks to persistent 
identifiers provided by publishers or other sources. Together, they constitute a trusted 
set {S\C') of 15388 distinct individuals sharing 36340 unique author names spread within 
1201763 signatures on 360066 publications. This data covers several decades in time and 
dozens of author nationalities worldwide. 

Following the INSPIRE terms of use, the signatures S' and their corresponding clusters 
C are released onlin^ under the CCO license. To the best of our knowledge, data of this 
size and coverage is the first to be publicly released in the scope of author disambiguation 
research. 


^https://github.com/glouppe/beard 

“https://github.com/glouppe/paper-author-disambiguation 







































































5.2 Evaluation protocol 


Experiments carried out to study the impact of the proposed algorithmic components and 
refinements, as described in Section follow a standard 3-fold cross-validation protocol, 
using (S',C') as ground-truth dataset. To replicate the |5'|/|5| « 13% ratio of claimed 
signatures with respect to the total set of signatures, as on the INSPIRE platform, cross- 
validation folds are constructed by sampling 13% of claimed signatures to form a training 
set C S'. The remaining signatures — S' \ are used for testing. Therefore, 
Strain = W ^ '^traink^ ^ ^'} represents the partial known clusters on the training fold, while 
Cjgst are those used for testing. 


As commonly performed in author disambiguation research, we evaluate the predicted clus¬ 
ters over testing data using both B3 and pairwise precision, recall and F-measure, as 
defined below: 


Pm{C,C,S) 

Rb3{C,C,S) 

FB3iC,C,S) 

Tpairwise (S , C ) 


R 


pairwise 


(C,C) 


pairwise 


(C,C) 


1 |c(s) n c(s)| 

i5i i?(.)i 

J_ |c(s) n c(s)| 

2PB3iC,C,S)RB3iC,C,S) 

Pb3{C,C,S)+Pb3{C,C,S) 

\p{C)np{C)\ 

b(C)| 

\piC)np{C)\ 


\piC)\ 


2P 

Z/J T 


pairwise 




pairwise 


(C,C) 


pairwise 


{C,C) + R 


pairwise 


(C,C) 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 


and where c(s) (resp. c(s)) is the cluster c G C such that s G c (resp. the cluster c G C 
such that s G c), and where p(C) = Ucec{(si, S 2 )lsi, S 2 G c, Si ^ S 2 } is the set of all 
pairs of signatures from the same clusters in C. Precision evaluates whether signatures are 
grouped only with signatures from the same true clusters, while recall measures the extent 
to which all signatures from the same true clusters are effectively grouped together. The 
F-measure is the harmonic mean between these two quantities. In the analysis below, we 
rely primarily on the B3 F-measure for discussing results, as the pairwise variant tends to 
favor large clusters (because the number of pairs is quadratic with the cluster size), hence 
unfairly giving preference to authors with many publications. By contrast, the B3 F-measure 
weights clusters linearly with respect to their size. General conclusions drawn below remain 
however consistent for pairwise F. 


5.3 Results and discussion 

Baseline. Results presented in Table are discussed with respect to a baseline solution 
using the following combination of components: 

• Blocking: same surname and the same first given name initial strategy (SFI); 

• Linkage function: all 22 features defined in Table[^ gradient boosted regression trees 

as supervised learning algorithm and a training set of pairs built from C^i-ain)) 

by balancing easy and difficult cases. 

• Clustering: agglomerative clustering using average linkage and block cuts found to 

maximize Fbs (C(,ain. ,ain > ) ■ 

Blocking. The good precision of the baseline (0.9901), but its lower recall (0.9760) suggest 
that the blocking strategy might be the limiting factor to further overall improvements. As 
shown in Table^ the maximum recall (be., if within a block, all signatures were clustered 









Table 2: Average precision, recall and f-measure scores on test folds. 




B3 



Pairwise 


Description 

P 

R 

F 

P 

R 

F 

Baseline 

0.9901 

0.9760 

0.9830 

0.9948 

0.9738 

0.9842 

Blocking = SFI 

0.9901 

0.9760 

0.9830 

0.9948 

0.9738 

0.9842 

Blocking = Double metaphone 

0.9856 

0.9827 

0.9841 

0.9927 

0.9817 

0.9871 

Blocking = NYSIIS 

0.9875 

0.9826 

0.9850 

0.9936 

0.9814 

0.9875 

Blocking = Soundex 

0.9886 

0.9745 

0.9815 

0.9935 

0.9725 

0.9828 

No name normalization 

0.9887 

0.9697 

0.9791 

0.9931 

0.9658 

0.9793 

Name normalization 

0.9901 

0.9760 

0.9830 

0.9948 

0.9738 

0.9842 

Classifier = GBRT 

0.9901 

0.9760 

0.9830 

0.9948 

0.9738 

0.9842 

Classifier = Random Forests 

0.9909 

0.9783 

0.9846 

0.9957 

0.9752 

0.9854 

Classifier = Linear Regression 

0.9749 

0.9584 

0.9666 

0.9717 

0.9569 

0.9643 

Training pairs = Non-blocked, uniform 

0.9793 

0.9630 

0.9711 

0.9756 

0.9629 

0.9692 

Training pairs = Blocked, uniform 

0.9854 

0.9720 

0.9786 

0.9850 

0.9707 

0.9778 

Training pairs = Blocked, balanced 

0.9901 

0.9760 

0.9830 

0.9948 

0.9738 

0.9842 

Clustering = Average linkage 

0.9901 

0.9760 

0.9830 

0.9948 

0.9738 

0.9842 

Clustering = Single linkage 

0.9741 

0.9603 

0.9671 

0.9543 

0.9626 

0.9584 

Clustering = Complete linkage 

0.9862 

0.9709 

0.9785 

0.9920 

0.9688 

0.9803 

No cut 

0.9024 

0.9828 

0.9409 

0.8298 

0.9776 

0.8977 

Global cut 

0.9892 

0.9737 

0.9814 

0.9940 

0.9727 

0.9832 

Block cut 

0.9901 

0.9760 

0.9830 

0.9948 

0.9738 

0.9842 

Combined best settings 

0.9888 

0.9848 

0.9868 

0.9951 

0.9831 

0.9890 


Table 3: Maximum recall and .Rpairwise blocking strategies, and their number of 
blocks on S'. _ 


Blocking 


^pairwise 

# blocks 

SFI 

0.9828 

0.9776 

12978 

Double metaphone 

0.9907 

0.9863 

9753 

NYSIIS 

0.9902 

0.9861 

10857 

Soundex 

0.9906 

0.9863 

9403 


optimally) for SFI is 0.9828. At the price of fewer and therefore slightly larger blocks (as 
reported in the right column of Table 1^, the proposed phonetic-based blocking strategies 
show better maximum recall (all around 0.9905), thereby pushing further the upper bound 
on the maximum performance of author disambiguation. Let us remind however that the 
reported maximum recalls for the blocking strategies using phonetization are also raised due 


to the better handling of multiple surnames, as described in Section 3.1 


As Table shows, switching to either Double metaphone or NYSIIS phonetic-based blocking 
allows to improve the overall F-measure score, trading precision for recall. In particular, 
the NYSIIS-based phonetic blocking shows to be the most effective when applied to the 
baseline (with an F-measure of 0.9850) while also being the most efficient computationally 
(with 10857 blocks versus 12978 for the baseline). 


Final ly, let us also note that Table corroborates the estimation of (Torvik and Smalheiser 


20091, stating that SFI blocking has a recall around 98% on real data. 


Name normalization. As discussed previously, the seemingly insignificant step of normalizing 
author names (stripping accents, removing affixes), as performed in the baseline, is shown 
to be important. Results from Table clearly suggest that not normalizing significantly 
reduces performance (yielding an F-measure of 0.9830 when normalizing, but decreasing to 
0.9791 when raw author name strings are used instead). 


Linkage function. Let us first comment on the results regarding the supervised algorithm 
used to learn the linkage function. As Tablej^indicates, both tree-based algorithms appear to 
be significantly better fit than Linear Regression (0.9830 and 0.9846 for GBRT and Random 
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Forests versus 0.9666 for Linear Regression). This result is consistent with (Treeratpituk 


and Giles 2009) which evaluated the use of Random Forests for author disambiguation, but 


contradicts results of ( Levin et al.[ 2012) for which Logistic Regression appeared to be the 
best classifier. Provided hyper-parameters are properly tuned, the superiority of tree-based 
methods is in our opinion not surprising. Indeed, given the fact that the optimal linkage 
function is likely to be non-linear, non-parametric methods are expected to yield better 
results, as the experiments here confirm. 


Second, properly constructing a training set of positive and negative pairs of signatures from 
which to learn a linkage function yields a significant improvement. A random sampling of 
positive and negative pairs, without taking blocking into account, significantly impacts 
the overall performance (0.9711). When pairs are drawn only from blocks, performance 
increases (0.9786), which confirms our intuition that d should be built only from pairs it 
will be used to eventually cluster. Finally, making the classification problem more difficult by 
oversampling complex cases proves to be relevant, by further improving the disambiguation 
results (0.9830). 


Using Recursive Feature Elimination Guyon et al. (2002), we next evaluate the usefulness of 
all fifteen standard and seven additional ethnicity features for learning the linkage function. 
The analysis consists in using the baseline algorithm first us ing all twenty two f eatures, to 
determine the least discriminative from feature importances (Louppe et al.l|2013l, and then 
re-learn the baseline using all but that one feature. That process is repeated recursively 
until eventually only one feature remains. Results are presented in Figure for one of the 
three folds, starting from the far right with the baseline and Second given name being the 
least important feature, and ending on the left with all features eliminated but Chinese. As 
the figure illustrates, the most important features are ethnic-based features {Chinese, Other 
Asian, Black) along with Co-authors, Affiliation and Full name. Adding the remaining 
other features only brings marginal improvements, with Journal, Abstract, Collaborations, 
References, Given name initial and Second given name being almost insignificant. Overall, 
these results highlight the added value of the proposed ethnicity features. Their duality 
in modeling both the similarity between author names and their origins make them very 


strong predictors for author disambiguation. The results also corroborate those from (Kang 


et al. 


2009) or (Ferreira et al., 2010), who found that the similarity between co-authors was 


a highly discriminative feature. If computational complexity is a concern, this analysis also 


(Treeratpituk and Giles 

2009 

) or ( 

Levin et al. 

2012 


Semi-supervised clustering. The last part of our experiment concerns the study of agglom- 
erative clustering and the best way to find a cut-off threshold to form clusters. Results from 
Table first clearly indicate that average linkage is significantly better than both single and 
complete linkage. 


Clustering together all signatures from the same block is the least effective strategy (0.9409), 
but yields anyhow surprisingly decent accuracy, given the fact it requires almost no compu¬ 
tation {i.e., both learning a linkage function and running agglomerative clustering can be 
skipped - only the blocking function is needed to group signatures). In particular, this result 
reveals that author names are not ambiguous in most case^and that only a small fraction 
of them requires advanced disambiguation procedures. On the other hand, both global and 
block cut thresholding strategies give very good results, with a slight advantage for block 
cuts (0.9814 versus 0.9830), as expected. In case S'^, is empty {e.g., because it corresponds to 
a young researcher at the beginning of his career), this therefore suggests that either using 
a cut-off threshold learned globally from the known data or using SFI would in general give 
satisfactory results, only marginally worse than if claimed signatures had been known. 

Combined best settings. When all best settings are combined {i.e., Blocking = NYSIIS, 
Name normalization. Classifier = Random Forests, Training pairs = blocked and balanced, 
Clustering = Average linkage. Block cuts), performance reaches 0.9862, i.e., the best of all 


^This holds for the data we extracted, but may in the future, with the rise of non-Western 
researchers, be an underestimate of the ambiguous cases. 
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Figure 4: Recursive Feature Elimination analysis. 
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reported results. In particular, this combination exhibits both the high recall of phonetic 
blocking based on the NYSIIS algorithm and the high precision of Random Forests. 


6 Conclusions 


In this work, we have revisited and validated the general author disambiguation pipeline in¬ 
troduced in previous independent research work. The generic approach is composed of three 
components, whose design and tuning are all critical to good performance: (i) a blocking 
function for pre-clustering signatures and reducing computational complexity, (ii) a link¬ 
age function for identifying signatures with coreferring authors and (iii) the agglomerative 
clustering of signatures. Making use of a distinctively large dataset of more than I million 
crowdsourced annotations, we experimentally study all three components and propose fur¬ 
ther improvements. With regards to blocking, we suggest to use phonetization of author 
names to increase recall while maintaining low computational complexity. For the linkage 
function, we introduce ethnicity-sensitive features for the automatic tailoring of disambigua¬ 
tion to non-Western author names whenever necessary. Finally, we explore semi-supervised 
cut-off threshold strategies for agglomerative clustering. For all three components, experi¬ 
ments show that our refinements all yield significantly better author disambiguation accu¬ 
racy. 


Overall, these results all encourage further improvements and research. For blocking, one 
of the open challenges is to manage signatures with inco nsist ent surnames or inconsistent 
first given names (cases 4 and 5, as described in Section 3.1) while maintaining blocks to 
a tractable size. As phonetic algorithms are not yet perfect, another direction for further 
work is the design of better phonetization functions, tailored for author disambiguation. 
For the linkage function, the good results of the proposed features pave the way for further 
research in ethnicity-sensitive author disambiguation. The automatic fitting of the pipeline 
to cultures and ethnic groups for which standard author disambiguation is known to be 
less efficient {e.g., Chinese authors with many homonyms) indeed constitutes a direction of 
research with great potential benefits for the concerned scientific communities. 


As part of this study, we also publicly release the annotated data extracted from the IN¬ 
SPIRE platform, on which our experiments are based. To the best of our knowledge, data 
of this size and coverage is the first to be available in author disambiguation research. By 
releasing the data publicly, we hope to provide the basis for further research on author 
disambiguation and related topics. 
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